AITopics | data commons

Collaborating Authors

data commons

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons

Song, Steven, Subramanyam, Anirudh, Zhang, Zhenyu, Venkat, Aarti, Grossman, Robert L.

arXiv.org Artificial IntelligenceDec-8-2025

The Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data through a unified curation and analysis platform centered around patient cohorts. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language. We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. GDC Cohort Copilot automatically generates the GDC cohort filter corresponding to a user-input natural language description of their desired cohort, before exporting the cohort back to the GDC for further analysis. An interactive user interface allows users to further refine the generated cohort. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts. We implement and share GDC Cohort Copilot as a containerized Gradio app on HuggingFace Spaces, available at https://huggingface.co/spaces/uc-ctds/GDC-Cohort-Copilot. GDC Cohort LLM weights are available at https://huggingface.co/uc-ctds. All source code is available at https://github.com/uc-cdis/gdc-cohort-copilot.

large language model, machine learning, natural language, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1093/bioadv/vbaf295

2507.02221

Country: North America > United States (0.48)

Genre: Research Report (0.83)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback

Community search signatures as foundation features for human-centered geospatial modeling

Sun, Mimi, Kamath, Chaitanya, Agarwal, Mohit, Muslim, Arbaaz, Yee, Hector, Schottlander, David, Bavadekar, Shailesh, Efron, Niv, Shetty, Shravya, Prasad, Gautam

arXiv.org Artificial IntelligenceOct-30-2024

Aggregated relative search frequencies offer a unique composite signal reflecting people's habits, concerns, interests, intents, and general information needs, which are not found in other readily available datasets. Temporal search trends have been successfully used in time series modeling across a variety of domains such as infectious diseases, unemployment rates, and retail sales. However, most existing applications require curating specialized datasets of individual keywords, queries, or query clusters, and the search data need to be temporally aligned with the outcome variable of interest. We propose a novel approach for generating an aggregated and anonymized representation of search interest as foundation features at the community level for geospatial modeling. We benchmark these features using spatial datasets across multiple domains. In zip codes with a population greater than 3000 that cover over 95% of the contiguous US population, our models for predicting missing values in a 20% set of holdout counties achieve an average $R^2$ score of 0.74 across 21 health variables, and 0.80 across 6 demographic and environmental variables. Our results demonstrate that these search features can be used for spatial predictions without strict temporal alignment, and that the resulting models outperform spatial interpolation and state of the art methods using satellite imagery features.

foundation feature, search signature, zip code, (13 more...)

arXiv.org Artificial Intelligence

2410.22721

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Texas > Harris County (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > United Kingdom (0.04)

Genre:

Research Report > Promising Solution (0.54)
Research Report > New Finding (0.54)

Industry:

Health & Medicine > Epidemiology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.92)
Banking & Finance (0.87)
(2 more...)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.68)

Add feedback

Google's new tool lets large language models fact-check their responses

MIT Technology ReviewSep-12-2024, 13:00:00 GMT

The first of the two methods is called Retrieval-Interleaved Generation (RIG), which acts as a sort of fact-checker. If a user prompts the model with a question--like "Has the use of renewable energy sources increased in the world?"--the model will come up with a "first draft" answer. Then RIG identifies what portions of the draft answer could be checked against Google's Data Commons, a massive repository of data and statistics from reliable sources like the United Nations or the Centers for Disease Control and Prevention. Next, it runs those checks and replaces any incorrect original guesses with correct facts. It also cites its sources to the user.

data commons, google, language model fact-check

MIT Technology Review

Country: Asia > Pakistan (0.08)

Industry:

Health & Medicine > Epidemiology (0.79)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.65)
Health & Medicine > Therapeutic Area > Immunology (0.65)
(3 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.75)

Add feedback

Knowing When to Ask -- Bridging Large Language Models and Data

Radhakrishnan, Prashanth, Chen, Jennifer, Xu, Bo, Ramaswami, Prem, Pho, Hannah, Olmos, Adriana, Manyika, James, Guha, R. V.

arXiv.org Artificial IntelligenceSep-10-2024

Large Language Models (LLMs) are prone to generating factually incorrect information when responding to queries that involve numerical and statistical data or other timely facts. In this paper, we present an approach for enhancing the accuracy of LLMs by integrating them with Data Commons, a vast, open-source repository of public statistics from trusted organizations like the United Nations (UN), Center for Disease Control and Prevention (CDC) and global census bureaus. We explore two primary methods: Retrieval Interleaved Generation (RIG), where the LLM is trained to produce natural language queries to retrieve data from Data Commons, and Retrieval Augmented Generation (RAG), where relevant data tables are fetched from Data Commons and used to augment the LLM's prompt. We evaluate these methods on a diverse set of queries, demonstrating their effectiveness in improving the factual accuracy of LLM outputs. Our work represents an early step towards building more trustworthy and reliable LLMs that are grounded in verifiable statistical data and capable of complex factual reasoning.

california county, data commons, query, (12 more...)

arXiv.org Artificial Intelligence

2409.13741

Country:

North America > United States > California > San Francisco County > San Francisco (0.30)
North America > United States > California > Santa Clara County > Mountain View (0.14)
North America > United States > California > Sonoma County (0.05)
(11 more...)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Public Health (1.00)
Government > Regional Government > North America Government > United States Government (0.49)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Building Flexible, Scalable, and Machine Learning-ready Multimodal Oncology Datasets

Tripathi, Aakash, Waqas, Asim, Venkatesan, Kavya, Yilmaz, Yasin, Rasool, Ghulam

arXiv.org Artificial IntelligenceDec-22-2023

The advancements in data acquisition, storage, and processing techniques have resulted in the rapid growth of heterogeneous medical data. Integrating radiological scans, histopathology images, and molecular information with clinical data is essential for developing a holistic understanding of the disease and optimizing treatment. The need for integrating data from multiple sources is further pronounced in complex diseases such as cancer for enabling precision medicine and personalized treatments. This work proposes Multimodal Integration of Oncology Data System (MINDS) - a flexible, scalable, and cost-effective metadata framework for efficiently fusing disparate data from public sources such as the Cancer Research Data Commons (CRDC) into an interconnected, patient-centric framework. MINDS offers an interface for exploring relationships across data types and building cohorts for developing large-scale multimodal machine learning models. By harmonizing multimodal data, MINDS aims to potentially empower researchers with greater analytical ability to uncover diagnostic and prognostic insights and enable evidence-based personalized care. MINDS tracks granular end-to-end data provenance, ensuring reproducibility and transparency. The cloud-native architecture of MINDS can handle exponential data growth in a secure, cost-optimized manner while ensuring substantial storage optimization, replication avoidance, and dynamic access capabilities. Auto-scaling, access controls, and other mechanisms guarantee pipelines' scalability and security. MINDS overcomes the limitations of existing biomedical data silos via an interoperable metadata-driven approach that represents a pivotal step toward the future of oncology data integration.

available online, dataset, repository, (15 more...)

arXiv.org Artificial Intelligence

2310.01438

Country:

North America > United States > Florida (0.04)
North America > United States > Oregon (0.04)
North America > United States > California > Santa Cruz County > Santa Cruz (0.04)
Europe > Middle East > Malta > Northern Region > Western District > Attard (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Diagnostic Medicine (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Integration (1.00)
(4 more...)

Add feedback

Text2Cohort: Facilitating Intuitive Access to Biomedical Data with Natural Language Cohort Discovery

Kulkarni, Pranav, Kanhere, Adway, Yi, Paul H., Parekh, Vishwa S.

arXiv.org Artificial IntelligenceNov-25-2023

The Imaging Data Commons (IDC) is a cloud-based database that provides researchers with open access to cancer imaging data, with the goal of facilitating collaboration. However, cohort discovery within the IDC database has a significant technical learning curve. Recently, large language models (LLM) have demonstrated exceptional utility for natural language processing tasks. We developed Text2Cohort, a LLM-powered toolkit to facilitate user-friendly natural language cohort discovery in the IDC. Our method translates user input into IDC queries using grounding techniques and returns the query's response. We evaluate Text2Cohort on 50 natural language inputs, from information extraction to cohort discovery. Our toolkit successfully generated responses with an 88% accuracy and 0.94 F1 score. We demonstrate that Text2Cohort can enable researchers to discover and curate cohorts on IDC with high levels of accuracy using natural language in a more intuitive and user-friendly way.

query, text2cohort, user input, (15 more...)

arXiv.org Artificial Intelligence

2305.07637

Country: North America > United States > Maryland > Baltimore (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Oncology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback

Data Commons

Guha, Ramanathan V., Radhakrishnan, Prashanth, Xu, Bo, Sun, Wei, Au, Carolyn, Tirumali, Ajai, Amjad, Muhammad J., Piekos, Samantha, Diaz, Natalie, Chen, Jennifer, Wu, Julia, Ramaswami, Prem, Manyika, James

arXiv.org Artificial IntelligenceSep-7-2023

Publicly available data from open sources (e.g., United States Census Bureau (Census), World Health Organization (WHO), Intergovernmental Panel on Climate Change (IPCC)) are vital resources for policy makers, students and researchers across different disciplines. Combining data from different sources requires the user to reconcile the differences in schemas, formats, assumptions, and more. This data wrangling is time consuming, tedious and needs to be repeated by every user of the data. Our goal with Data Commons (DC) is to help make public data accessible and useful to those who want to understand this data and use it to solve societal challenges and opportunities. We do the data processing and make the processed data widely available via standard schemas and Cloud APIs. Data Commons is a distributed network of sites that publish data in a common schema and interoperate using the Data Commons APIs. Data from different Data Commons can be joined easily. The aggregate of these Data Commons can be viewed as a single Knowledge Graph. This Knowledge Graph can then be searched over using Natural Language questions utilizing advances in Large Language Models. This paper describes the architecture of Data Commons, some of the major deployments and highlights directions for future work.

data commons

arXiv.org Artificial Intelligence

2309.13054

Country: North America > United States (0.87)

Genre: Research Report (0.40)

Industry: Government > Regional Government > North America Government > United States Government (0.53)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.87)

Add feedback

AI and the tyranny of the data commons

Al JazeeraAug-19-2023, 09:46:22 GMT

I am here to tell you the sad but true story of the demise of the sharing economy. Remember how we were told, back in the 1990s and 2000s, that we were contributing to the creation of the largest commons known to humanity? Well, to paraphrase The Lord of the Rings, we were all of us deceived, for another ring was made. Artificial intelligence (AI) is making that clearer than ever. The free data we generated by spending thousands of hours on Big Tech's platforms has been appropriated and converted into training data for AI models.

commons, data commons, tyranny, (13 more...)

Al Jazeera

Country:

North America > United States (0.05)
Europe > Switzerland (0.05)
Europe > Spain (0.05)
(2 more...)

Technology:

Information Technology > Artificial Intelligence (1.00)
Information Technology > Communications > Social Media (0.51)

Add feedback

The Price of Your AI-Generated Selfie

TIME - TechDec-19-2022, 19:13:46 GMT

The recent flooding of social media feeds with AI-generated "portraits" derived from databases of artists' work has renewed conversation over data ownership and the potential power AI has to supplant livelihoods in the future. The 22 million individuals and counting who have already handed over their images to the Lensa application might be fine to receive the myriad of AI-illustrated images in exchange for their data. But the fundamental rights, principles, and freedoms users are giving up during this exchange remains largely unchecked. In Web3 technology circles, much promises have been made of decentralized technologies to open up the possibility for individual ownership and monetization of data, returning power to "creators." This reflects the political ethos held by Blockchain proponents like Etherum co-founder Joe Lubin, who ostensibly seek to supplant the existing power structures of finance through "permissionless" consensus-based transaction data structures.

data ownership, ownership, rights, (12 more...)

TIME - Tech

Country:

Asia > India (0.15)
North America > United States > Virginia (0.05)
North America > United States > Utah (0.05)
(5 more...)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Government (0.97)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence (1.00)

Add feedback

Building a better data economy

MIT Technology ReviewMar-11-2021, 18:13:10 GMT

It's "time to wake up and do a better job," says publisher Tim O'Reilly--from getting serious about climate change to building a better data economy. And the way a better data economy is built is through data commons--or data as a common resource--not as the giant tech companies are acting now, which is not just keeping data to themselves but profiting from our data and causing us harm in the process. "When companies are using the data they collect for our benefit, it's a great deal," says O'Reilly, founder and CEO of O'Reilly Media. "When companies are using it to manipulate us, or to direct us in a way that hurts us, or that enhances their market power at the expense of competitors who might provide us better value, then they're harming us with our data." And that's the next big thing he's researching: a specific type of harm that happens when tech companies use data against us to shape what we see, hear, and believe. It's what O'Reilly calls "algorithmic rents," which uses data, algorithms, and user interface design as a way of controlling who gets what information and why. Unfortunately, one only has to look at the news to see the rapid spread of misinformation on the internet tied to unrest in countries across the world. We can ask who profits, but perhaps the better question is "who suffers?" According to O'Reilly, "If you build an economy where you're taking more out of the system than you're putting back or that you're creating, then guess what, you're not long for this world." That really matters because users of this technology need to stop thinking about the worth of individual data and what it means when very few companies control that data, even when it's more valuable in the open. After all, there are "consequences of not creating enough value for others." We're now approaching a different idea: what if it's actually time to start rethinking capitalism as a whole? "It's a really great time for us to be talking about how do we want to change capitalism, because we change it every 30, 40 years," O'Reilly says. He clarifies that this is not about abolishing capitalism, but what we have isn't good enough anymore. "We actually have to do better, and we can do better. And to me better is defined by increasing prosperity for everyone."

algorithm, data economy, google, (15 more...)

MIT Technology Review

Country:

North America > United States > California (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Germany (0.04)

Genre: Personal (0.68)

Industry:

Law (1.00)
Banking & Finance (1.00)
Information Technology > Services (0.68)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence (0.93)
Information Technology > Human Computer Interaction > Interfaces (0.88)
Information Technology > Information Management > Search (0.68)

Add feedback